Sentiment Analysis Analysis (Sentiment Analysis On Movie Reviews)

An exploration of the application of machine learning techniques to the sentiment analysis of movie review.

by Lucas Hure and Ian Smith

Overview

This project is an exploration of different machine learning techniques applied to Sentiment Analysis in the context of movie reviews. Sentiment Analysis is the process of building a model to analyze and predict sentiment quality using natural language processing and text analysis techniques.

The dataset we used was provided by the website Kaggle and is a collection of movie reviews from Rotten Tomatoes. Typically, sentiment classification is done on a binary scale of "positive" or "negative". One reason we chose this specific dataset is that the reviews are rated on a 5-degree scale:

0 - negative, 1 - somewhat negative, 2 - neutral, 3 - somewhat positive, 4 - positive

This makes the classification task much more granular and therefore much more challenging, with an accuracy expectation of 20% for a fully random model, as opposed to 50% for a dataset with a binarized sentiment scale. Very little work has been done on classification on such datasets and this was a major motivation for us. We were also very interested in Natural Language Processing and in seeing how certain applications of NLP would affect the accuracy of our models.

The first part of this project comprises Data Exploration, where we look at the distribution of sentiment classes, the correlation between negation and sentiment, phrase length and sentiment, the ways in which sentiments are mispredicted (which class are they predicted as in that case), etc...

Throughout the rest of our project, we tried to achieve a balance between optimizing the performance of a certain model and exploring new ones. Specifically, we started by engineering our own naive model "from the ground up" as a baseline for exploration and improvement. We then moved on to using a Naive Bayes model, a Logistic Regression model and Support Vector Machines.

How to Access Project Data and Sources

All of the project's data and code can be found in the following public github repository: https://github.com/DryingPole/sentiment

The repository has three main directories:

notebooks: This folder contains all of the IPython notebooks used for our project. All notebooks except Final Project Process Book.ipynb should be ignored, as they only contain works in progress that were later migrated to the final notebook. Final Project Process Book.ipynb is the notebook that is uploaded to iSites.

resources: This directory contains all of the data used for the project. This includes primarily train.tsv and test.tsv, which were obtained from the Kaggle site.

sentiment: This directory contains a few Python files. The primary Python files of interest at core.py and bow.py. core.py centralizes a number of helper functions used throughout the notebook. bow.py includes some of our intial models, including the word-dictionary based model and an ensemble model.

Inspiration

Kaggle competition Website: http://www.kaggle.com/c/sentiment-analysis-on-movie-reviews


In [1]:
# special IPython command to prepare the notebook for matplotlib
%matplotlib inline 
%load_ext autoreload
%autoreload 2

# load required modules
import requests 
import StringIO
import zipfile
import numpy as np
import pandas as pd 
import scipy as sp
import matplotlib.pyplot as plt 
import datetime as dt 
import random
import collections
import re

import seaborn as sns

# load custom modules
import imp
source_path = '../sentiment/'
core = imp.load_source('core', source_path + 'core.py')
bow = imp.load_source('bow', source_path + 'bow.py')
bayes = imp.load_source('bayes', source_path + 'bayes.py')
None

Hands On: Exploring the Data Set


In [2]:
# use 'load_reviews' to load the data set into a data frame. This will "normalize" the 
# headings and make all phrases lower case. 
phrases = core.load_reviews('../resources/train.tsv')
phrases.head()


Out[2]:
phraseid sentenceid phrase sentiment
0 1 1 a series of escapades demonstrating the adage ... 1
1 2 1 a series of escapades demonstrating the adage ... 2
2 3 1 a series 2
3 4 1 a 2
4 5 1 series 2

Analyzing the Distribution of Sentiment Classes in the Data Set

We can see from the below analysis that the sentiment classes follow a roughly normal distribution, with the majority of samples classified as 2 (or "neutral"). More extreme positive and negative sentiments are rarer, with extremely negative sentiments making up the smallest part of the data set at only 4.5%.


In [3]:
# Get the total counts for each sentiment class within the data set
sent_counts = phrases[['sentiment']].groupby(by='sentiment').size().to_frame(name='Occurences')
sent_dist = sent_counts.divide(phrases.shape[0]).rename(columns={'Occurences': 'Distribution'})
sent_counts.merge(sent_dist, left_index=True, right_index=True)


Out[3]:
Occurences Distribution
sentiment
0 7072 0.045316
1 27273 0.174760
2 79582 0.509945
3 32927 0.210989
4 9206 0.058990

In [4]:
import math
import matplotlib.mlab as mlab

# Distribution of each class within the data set
mew, std = phrases.sentiment.mean(), phrases.sentiment.std()
xs = np.linspace(-1, 5, 200)
sent_dist.plot(kind='bar', title='Distribution of Sentiment Class vs. Normal Distribution', figsize=(14,8), legend=None)
plt.plot(xs, mlab.normpdf(xs, mew, std), c='r')
plt.xlabel('Sentiment Class')
plt.ylabel('Overall Percent of Data Set')
None


Next we wanted to try to get a sense for how strongly correlated negative reviews were with certain "fundamental" negative or negation words. For this purpose, we defined a tiny dictionary of "negation" words. Our theory is that if identifying a such a correlation might help us identify negative reviews more readily since it's a well known problem that naive sentiment detection methods often fail to detect these constructs.


In [5]:
# Data Exploration -- capture some features of each phrase
import core

def load_negation_dict(path='../resources/neg_words.csv'):
    ndf = pd.read_csv(path, names=['neg_word'], index_col=0)
    ndf['Value'] = True
    return ndf.to_dict()['Value']

n_words = load_negation_dict()
pe = phrases.copy()
pe['word_list'] = pe.phrase.str.split()
pe['neg_count'] = map(lambda ws: core.lreduce(lambda acc, w: acc + (1 if w in n_words else 0), ws, 0), pe.word_list)
pe['contains_neg'] = map(lambda nc: True if nc > 0 else False, pe.neg_count)
pe['word_count'] = map(len, pe.word_list)
pe.head()


Out[5]:
phraseid sentenceid phrase sentiment word_list neg_count contains_neg word_count
0 1 1 a series of escapades demonstrating the adage ... 1 [a, series, of, escapades, demonstrating, the,... 1 True 37
1 2 1 a series of escapades demonstrating the adage ... 2 [a, series, of, escapades, demonstrating, the,... 0 False 14
2 3 1 a series 2 [a, series] 0 False 2
3 4 1 a 2 [a] 0 False 1
4 5 1 series 2 [series] 0 False 1

In [12]:
# Analyze phrase sentiment as related to the appearance of the wo 
g = pe[pe['contains_neg'] == True].groupby('sentiment').count()
prop_of_neg = sent_counts.merge(g, left_index=True , right_index=True)
prop_of_neg['prop'] = prop_of_neg.neg_count / prop_of_neg.Occurences
prop_of_neg.plot(kind='bar', y='prop', figsize=(12,8))
plt.title('Proportion of Samples Containing "Negation" Words by Sentiment Class')
plt.xlabel('Sentiment Class')
plt.ylabel('Percentage of Samples Containing "Negation" Words')
None


We found that the even using a very small dictionary of "negation" words, these words were occurred with almost twice as frequently in the "negative" sentiment classes than the neutral and positive sentiment classes. However, we realized that our method for tokenizing the phrases into words might not be entirely correct since it splits phrases on spaces. To see an example of the kind of issue one might encounter, consider the following examples.

Parsing Phrases - Some Fine Tuning

Consider the example phrase "We considered parsing, but it was too difficult." What we find is that the appearance of punctuation in phrases can have the undersired effect of obscuring the words with which they co-occur. This means that our models may not recognize potentially sentiment-significant words if they coincide with punctuation.


In [13]:
sample_phrase = "We considered parsing, but it was too difficult."
naive_split = sample_phrase.split()
better_split = re.split("\W+", sample_phrase)
print naive_split
print better_split


['We', 'considered', 'parsing,', 'but', 'it', 'was', 'too', 'difficult.']
['We', 'considered', 'parsing', 'but', 'it', 'was', 'too', 'difficult', '']

Normal space-based splitting yields:

['We', 'considered', 'parsing,', 'but', 'it', 'was', 'too', 'difficult.']

Improved word-based splitting yields:

['We', 'considered', 'parsing', 'but', 'it', 'was', 'too', 'difficult', '']


In [ ]:

Next, we wanted to determine if a meaningful heuristic could be extracted from the data to establish a correlation between the length of the phrase and it's sentiment class.


In [14]:
# Determine the max word count to establish an appropriate range for bucketing phrases by number of words
print "Longest Phrase: ", max(pe.word_count)

def filter_by_wc(wc):
    """
    Define a function that groups phrases by number of words within a defined range
    """
    return pe[['sentiment']][(pe.word_count <= wc) & (pe.word_count > (wc-10))].sentiment.tolist()

bins = [filter_by_wc(x) for x in np.arange(10,60,10)]

plt.figure(figsize=(14,14))
plt.title("Distribution of Sentiment Ratings within Phrases")
plt.xlabel("Words per Phrase")
plt.ylabel("Distribution of Sentiment Labels")
sns.violinplot(bins, names = [str(x)+" to "+str(y) for x, y in zip(np.arange(0,50,10), np.arange(10,60,10))])
None


Longest Phrase:  52

We can see that as phrase length increases, the distribution of classes becomes less normal. Fewer phrases are rated 2s and more phrases receive either positive or negative ratings. We suspect that this is because longer phrases will tend to provide more overall context for a review. If you look a many of the shorter phrases in the data set, they are in fact sub-phrases of the longer phrases. Many of these shorter phrases on their own lack sufficient context or the appearance of "strong sentiment" words needed to give the phrase definite polarity.

While we are not immediately sure how to exploit this correlation in terms of detecting a phrases sentiment, we think it may help us make more effective choices of the phrases we use for training our models later on. Alternately, we may be able to augment some of our models with evidence-based heuristics to provide some improvement.


In [155]:

Conclusions from Data Exploration

We found it difficult to extract meaningful features from the data set using elementary analysis. Some of our initial assumptions about the data proved not to hold: the number of words in a review appear not to have any bearing on its overall sentiment; the appearance of "negating" terms (such as "not", "neither", etc.) are more strongly correlated with neutral and negative ratings than positive ratings, but the overall proportion of phrases in the data set containing negating terms is quite low, suggesting that this would be a difficult feature to exploit.

Models

An Initial Approach to Sentiment Analysis: Dictionary-Based Ranking Model

For our first model, we wanted to try something extremely naive in order to try to gage the difficulty of the problem. This first approach is quite similar to a "Bag of Words" model typically seen with Naive Bayes. That is, this initial model does not take phrase structure or part of speech into account at all. The approach is simple: glean from the training data set every one-word phrase and it's associated sentiment score; once the dictionary has been built, parse each phrase into individual words and use a rudimentary algorithm to aggregate the ranking of a phrase's individual words into a holistic sentiment score. The idea behind this approach is to establish a baseline for indicativeness of individual word polarities within a phrase. If we are lucky, the data set will have a sufficiently rich set of sentiment-laden words to allow this algorithm to perform moderately well.

With a balanced data set -- that is, one in which each class appears equally frequently -- we would expect randomly guessing each phrase's sentiment class to be roughly 20% accurate. Thus, if our naive algorithm can double or triple that amount, it will already represent a good deal of progress and give us an indication how effective we can expect such a naive "bag of words" approaches to be. Note that thiss fine-grained sentiment classification represents a significantly more difficult task than a typical "binary" classification scheme. As such, predicting classes with more than 60% accuracy is actually quite good; in fact the highest score currently on Kaggle is close to 75%.

As a first attempt, we define a model called 'BagOfWordsModel' in the module 'bow.py'. This model provides 'fit' and 'preidct' methods; the 'fit' method expects a list of single-word phrases, and a list of corresponding sentiments for those phrases. It builds a dictionary that is used by the 'predict' method to predict sentiments.

For our first attempt, we will split our training set into one-word phrases and multi-word phrases. Our one-word phrases will serve as our 'training' data set. After we've trained the model, we'll predict sentiments for the remaining phrases, then determine the model's accuracy by comparing against the known sentiment labels.

The exact method used for label prediction is quite simple: we use a weighted average of the sentiments of the each words in a phrase and round the number to the nearest integer to arrive at the predicted class. Our model also optionally takes a dictionary of class weights to allow certain classes to contribute more to the overall prediction for a phrase.


In [15]:
# use the one-word phrases as a training set for the BoW model. Split the data set into training data - one-word phrases -
# and test data - all other phrases

# Define the regex for single words and use this to split the data set
single_word = r'\A[\w-]+\Z'

# Extract the training data
train = phrases[phrases.phrase.str.match(single_word)][['phrase', 'sentiment']]
X_train, Y_train = train.phrase.tolist(), train.sentiment.tolist()

# Extract the test data
test = phrases[map(lambda x: not x, phrases.phrase.str.match(single_word))][['phrase', 'sentiment']]
X_test, Y_test = test.phrase, test.sentiment

# Show the head of the training and test data sets
print "Training Set:"
print train.head(5)
print "\nTest Set:"
print test.head(5)


Training Set:
           phrase  sentiment
3               a          2
4          series          2
6              of          2
8       escapades          2
11  demonstrating          2

Test Set:
                                              phrase  sentiment
0  a series of escapades demonstrating the adage ...          1
1  a series of escapades demonstrating the adage ...          2
2                                           a series          2
5  of escapades demonstrating the adage that what...          2
7  escapades demonstrating the adage that what is...          2

In [16]:
# Train the model, then predict and score sentiments for the test data
bow_model = bow.BagOfWordsModel()
bow_model.fit(X_train, Y_train)
accuracy = bow_model.score(X_test, Y_test)
print accuracy
print bow_model._scoring


0.509524729936
defaultdict(<function <lambda> at 0x12e450ed8>, {0: 227.0, 1: 4957.0, 2: 54876.0, 3: 10669.0, 4: 446.0})

Our first attempt with a naive algorithm gives us a 51% classification accuracy, which is more than two and a half times better than randomly guessing the classes.

This isn't bad for a first pass, but we want to get a better sense for which classes prove most difficult to predict. We can use our sent_counts data frame to determine our overall accuracy for each sentiment class.


In [17]:
acc_counts = bow_model._scoring
sent_count_dict = sent_counts.to_dict()['Occurences']

bow_acc_df = pd.DataFrame(data=np.transpose([acc_counts.keys(), 
                                             acc_counts.values(), 
                                             sent_count_dict.values()]),
                          columns=['sentiment', 'correct', 'total'])

bow_acc_df['accuracy'] = np.round(bow_acc_df.correct / bow_acc_df.total, 3)
print bow_acc_df[['sentiment', 'accuracy']].head()

bow_acc_df.plot(kind='bar', x='sentiment', y='accuracy', title='Predictive Accuracy by Class', figsize=(8, 6))
plt.ylabel('Accuracy')
plt.xlabel('Sentiment Class')
None


   sentiment  accuracy
0          0     0.032
1          1     0.182
2          2     0.690
3          3     0.324
4          4     0.048

As we can see from the above, our model performs exceptionally poorly on the most extreme sentiment classes of 0 and 4. Since our model drives off of the sentiments of individual words and allows us to adjust the weights assigned to each class, let's see if adjusting the default weights can improve our prediction accuracy.


In [19]:
# Define a function that searches a number of weight combinations and returns the weight combination and model that receives the
# best accuracy score.
def search_best_weights():
    best_acc = 0.0
    best_weights = None
    best_model = None
    zeros = [5, 25, 50, 100]
    ones = [2, 10, 20]
    threes = [3, 5]
    fours = [5, 25, 50]
    for z in zeros:
        for o in ones:
            for t in threes:
                for f in fours:
                    wm = {0: z, 1: o, 2: 1, 3: t, 4: f}
                    wbm = bow.BagOfWordsModel()
                    wbm.fit(X_train, Y_train)
                    wbm_acc = wbm.score(X_test, Y_test, weight_map=wm)
                    if wbm_acc > best_acc:
                        best_acc = wbm_acc
                        best_weights = wm
                        best_model = wbm
    return best_acc, best_model, best_weights

ba, bm, bw = search_best_weights()
print ba, bw


0.511765421758 {0: 25, 1: 2, 2: 1, 3: 3, 4: 5}

Based on this search, it appears that our original default weights -- {0: 5, 1: 3, 2: 1, 3: 3, 4: 5} -- were the best. There doesn't seem to be much more to be gained using this model's approach given that a number of different weight combinations only degraded performance.

Multinomial Naive Bayes

Mutinomial Naive Bayes is often used to perform basic sentiment analysis in what is called a "bag of words" approach. To use this model, we'll use the CountVectorizer class of scikit learn to convert our training data into a vector where each word represents a feature. The CountVectorizer will return a sparse matrix indicating the features -- or words -- that appear in each phrase.

Our research of sentiment-analysis methodologies indicates that word frequency tends not to provide a good indication of sentiment so we'll use a binarized Multinomial Bayes algorithm in which each word-feature is counted only once per phrase. We'll also use a default stop-words list to try to pair down the feature list and improve accuracy.


In [20]:
# Prepare the data -- use the vectorizer to return a feature set that corresponds all of the vocabulary found in the data set.
from sklearn.cross_validation import train_test_split
from sklearn.feature_extraction.text import CountVectorizer

swords = 'english' # eliminate stop words
rarray = phrases.phrase.tolist()
vectorizer = CountVectorizer(binary=True, stop_words=swords)
vectorizer.fit(rarray)
X = vectorizer.transform(rarray).tocsc()
Y = phrases.sentiment.tolist()
xtrain, xtest, ytrain, ytest = train_test_split(X, Y)

We'll define some helper functions to help us transform data and run our models. This will help us reduce the amount of boilerplat code we need to write to test our models.


In [21]:
'''
Function
--------
A function that will vectorize a data frame of reviews into a sparse matrix and a list
'''
def transform_data(reviews, **kwargs):
    if "binary" not in kwargs:
        kwargs["binary"] = True
    if "min_df" not in kwargs:
        kwargs["min_df"] = 0.0
    rarray = reviews.phrase.tolist()
    vectorizer = kwargs['vectorizer'] if 'vectorizer' in kwargs else CountVectorizer(**kwargs)
    vectorizer.fit(rarray)
    X = vectorizer.transform(rarray).tocsc()
    return X, reviews.sentiment.tolist()
  
    
'''

'''
def split(X, Y, **kwargs):
    return train_test_split(X, Y, **kwargs)
    
'''
Function
--------
Prepare specific dataset and run specific model on it.
Vectorizes and splits data into train and test sets, then fits data to the model decided upon and prints out 
accuracy scores.

Parameters
----------
review_df: reviews dataframe. The original "reviews" set, the "stop_reviews" set, the "bal_reviews" dataset, 
or the "bin_reviews" dataset
model: an instance of any model that provides 'fit', 'predict', and 'score' methods

Returns
-------
prints out accuracy score 
'''
def test_model(model, data=None):
    xtrain, xtest, ytrain, ytest = data
    mf = model.fit(xtrain, ytrain)
    print "Model[%s] Accuracy: %0.2f%%" % (str(mf), mf.score(xtest, ytest))
    return mf

In [22]:
from sklearn.naive_bayes import MultinomialNB

X_trans, Y_trans = transform_data(phrases)
XY_all = split(X_trans, Y_trans)
mnb = test_model(MultinomialNB(), data=XY_all)


Model[MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)] Accuracy: 0.61%

Our first naive model, using almost no data pre-processing gets an accuracy rate of 61% on the data set. Below we analyze visualize the scatter plot of the expected vs. actual predictions to get a sense for which classes are most accurately predicted.


In [23]:
def scatter_exp_vs_pred(df=None, expected=None, predicted=None):
    """
    A function that takes a data frame with 'expected' and 'predicted' columns and
    creates a scatter plot for those columns
    :param df: A dataframe with 'expected' and 'predicted' columns, optionally None
    :param expected: A list of expected values
    :param predicted: A list of predicted values
    :return: None.
    """
    if df is None:
        df = pd.DataFrame()
        df['expected'] = expected
        df['predicted'] = predicted

    # res_df = pd.DataFrame(data=np.transpose([pred, XY_all[3]]), columns=["predicted", "expected"])
    rstats = df.groupby(["predicted", "expected"]).size().reset_index().rename(columns={0: "counts"})

    plt.figure(figsize=(12, 8))
    plt.title("Predicted vs. Expected Sentiment Values")
    plt.ylabel("Predicted Sentiment")
    plt.xlabel("Expected Sentiment")
    plt.scatter(rstats.expected, rstats.predicted, s=[0.1*c for c in rstats.counts], alpha=0.6)
    plt.show()
    return rstats

In [24]:
# Create a data frame of the expected vs. predicted sentiments. Count each predicted,expected pair and plot
pred = mnb.predict(XY_all[1])
rstats = scatter_exp_vs_pred(expected=XY_all[3], predicted=pred)


We can see based on above scatter plot that most of the miscategorizations are showing up in neigbhoring sentiment classes. 2s are generally being miscategorized as 1s or 3s. 0s are mostly miscategorized as 1s and most miscategorizations of 3s are either 2s or 4s. This seems somewhat encouraging if we view the classification problem as a linear regression problem; however, MNB does not use this model for classification.


In [25]:
# Show exact figures for expected vs. predicted sentiment class.
rstats.head(25)


Out[25]:
predicted expected counts
0 0 0 520
1 0 1 449
2 0 2 179
3 0 3 21
4 0 4 5
5 1 0 851
6 1 1 3025
7 1 2 1915
8 1 3 337
9 1 4 35
10 2 0 393
11 2 1 2931
12 2 2 15415
13 2 3 3179
14 2 4 334
15 3 0 55
16 3 1 338
17 3 2 2203
18 3 3 4117
19 3 4 1189
20 4 0 4
21 4 1 22
22 4 2 186
23 4 3 593
24 4 4 719

The disadvantage of the above approach is that is doesn't provide any indication of whether the model is being overfit. To correct this problem, we'll look at the respective accuracy scores of the model on both the training data and the test data.


In [26]:
xtrain, xtest, ytrain, ytest = XY_all
training_accuracy = mnb.score(xtrain, ytrain)
test_accuracy = mnb.score(xtest, ytest)

print "Accuracy on training data: %0.2f" % (training_accuracy)
print "Accuracy on test data:     %0.2f" % (test_accuracy)


Accuracy on training data: 0.67
Accuracy on test data:     0.61

It looks like the may be some slight overfitting. To get a better sense for the model's overall performance, we'll use cross validation and see how the mean cross validation score compares to the above score on the test data.


In [27]:
from sklearn.cross_validation import cross_val_score
r = cross_val_score(MultinomialNB(), xtrain, ytrain, cv=10) 
print
print "10-fold Cross Validation Scores: ", core.lreduce(lambda s, f: s + ("%0.2f, " % f), r, "")
print "Average Score: %0.2f" % r.mean()


10-fold Cross Validation Scores:  0.60, 0.60, 0.61, 0.61, 0.61, 0.61, 0.61, 0.60, 0.60, 0.60, 
Average Score: 0.61

Our model consistently achieves accuracy scores between 60 and 62%. We wanted to see if we achieve any improvement in accuracy by performing a grid search and using different parameters. The results of the grid search below show only the most marginal improvement with tuned parameters. It doens't look like it will deliver the sort of improvements we desire.


In [28]:
alphas = [0.0, 0.5, 1.0, 5, 10, 20, 50]
min_df = [0.00001, 0.0001, 0.001, 0.005, 0.01]
best_sc = 0.0
best_mnb = None
best_params = None

for md in min_df: 
    x, y = transform_data(phrases, min_df=md)
    xtrain, xtest, ytrain, ytest = split(x, y)
    for a in alphas:
        clf = MultinomialNB(alpha=a).fit(xtrain, ytrain)
        sc = clf.score(xtest, ytest)
        if sc > best_sc:
            best_params = (a, md)
            best_sc = sc
            best_mnb = clf

In [29]:
print "Best Parameters: alpha:\t%0.2f\tmin_df:\t%0.5f" % (best_params[0], best_params[1])
print "Best Accuracy: ", best_sc


Best Parameters: alpha:	1.00	min_df:	0.00001
Best Accuracy:  0.60963731898

After obtaining an only marginal improvement with a parameter search, we wanted to try another model -- Logistic Regression -- to see if another model might deliver better results.

Logistic Regression Models


In [33]:
from sklearn.linear_model import LogisticRegression

xtrain, xtest, ytrain, ytest = XY_all
clf = LogisticRegression().fit(xtrain, ytrain)

lreg_trnscore = clf.score(xtrain, ytrain)
lreg_tscore = clf.score(xtest, ytest)

print "Training accuracy %0.2f" % lreg_trnscore
print "Test accuracy %0.2f" % lreg_tscore


Training accuracy 0.70
Test accuracy 0.64
A cursory test using a logistic regression model shows a 3% improvement in performance. We plot the expected

In [34]:
lreg_pred = clf.predict(xtest)
lreg_stats = scatter_exp_vs_pred(expected=ytest, predicted=lreg_pred)


To put these number into perspective, we join the lreg_stats data frame to data frame that was returned for the MultinomialNB classifier. The 'comparison' column shows how logistic regression classifications compare to those of the MNB classifier.


In [35]:
comp = lreg_stats.rename(columns={'counts':'LR_counts'}).merge(rstats, left_on=['predicted', 'expected'], right_on=['predicted', 'expected'])
comp['comparison'] = map(lambda x, y: float(x) / y, comp.LR_counts, comp.counts)
comp.head(25)


Out[35]:
predicted expected LR_counts counts comparison
0 0 0 490 520 0.942308
1 0 1 310 449 0.690423
2 0 2 75 179 0.418994
3 0 3 12 21 0.571429
4 0 4 3 5 0.600000
5 1 0 800 851 0.940071
6 1 1 2477 3025 0.818843
7 1 2 998 1915 0.521149
8 1 3 195 337 0.578635
9 1 4 20 35 0.571429
10 2 0 494 393 1.256997
11 2 1 3717 2931 1.268168
12 2 2 17708 15415 1.148751
13 2 3 4057 3179 1.276187
14 2 4 378 334 1.131737
15 3 0 34 55 0.618182
16 3 1 247 338 0.730769
17 3 2 1080 2203 0.490241
18 3 3 3512 4117 0.853048
19 3 4 1156 1189 0.972246
20 4 0 5 4 1.250000
21 4 1 14 22 0.636364
22 4 2 37 186 0.198925
23 4 3 471 593 0.794266
24 4 4 725 719 1.008345

In [36]:
outperform = comp[(comp.comparison > 1) & (comp.expected == comp.predicted)]
outperform.head()


Out[36]:
predicted expected LR_counts counts comparison
12 2 2 17708 15415 1.148751
24 4 4 725 719 1.008345

We see that logistic regression is doing better primarily as a result of guess 2s much better than MNB. Next, we want to see if we can tune our logistic regression model by testing different smoothing paramters. For this we use a grid search to test values between 2 and 5.


In [37]:
# use cross validation to find the optimal value for k
from sklearn.grid_search import GridSearchCV
c = np.arange(2, 5, 0.4)
param = {'C': c}
lreg = LogisticRegression()
clf2 = GridSearchCV(lreg, param, cv=3)
clf2.fit(xtrain, ytrain)


Out[37]:
GridSearchCV(cv=3,
       estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, penalty='l2', random_state=None, tol=0.0001),
       fit_params={}, iid=True, loss_func=None, n_jobs=1,
       param_grid={'C': array([ 2. ,  2.4,  2.8,  3.2,  3.6,  4. ,  4.4,  4.8])},
       pre_dispatch='2*n_jobs', refit=True, score_func=None, scoring=None,
       verbose=0)

In [398]:
# visualize scores
a = clf2.grid_scores_
scores = [b.cv_validation_scores for b in a]

def plot_gs_results(scores):
    fig = plt.figure(figsize=(12, 12))

    # Add a boxplot of score per C value
    ax1 = fig.add_subplot(211)
    sns.boxplot(scores, ax=ax1)
    plt.title('Distribution of Accuracy Scores as a Function of $C$')
    plt.ylabel('Prediction Accuracy')
    plt.xticks(np.arange(1,9,1), [str(x) for x in param['C']])

    # Add a plot of the mean scores
    ax2 = fig.add_subplot(212)
    plt.title('Mean Accuracy Score as a Function of $C$')
    plt.xlabel('Choice of C')
    plt.ylabel('Prediction Accuracy')
    plt.scatter(xrange(1,9), np.mean(scores, axis=1), c='C', marker='o')
    plt.xticks(np.arange(1,9,1), [str(x) for x in param['C']])
    plt.xlabel('Choice of C')
    plt.show()
    None

plot_gs_results(scores)


Based on the plot above, we can see that $2.8$ provides the best value for C, although the difference between each value is $C$ is nearly negligible.


In [377]:
opti_clf = LogisticRegression(C = 2.8).fit(xtrain, ytrain)
print "Accuracy: %0.5f%%" % opti_clf.score(xtest,ytest)


Accuracy: 0.64188%

Support Vector Machines

Finally, we wanted to see how accurate an SVM model would do against this data set. SVM models, while less computationally effeccient, as supposed to generally perform well with sentiment analysis taks.


In [399]:
from sklearn.svm import LinearSVC

svc = LinearSVC().fit(xtrain, ytrain)
svc_train_scr = svc.score(xtrain, ytrain)
svc_test_scr = svc.score(xtest, ytest)
print "SVC Accuracy (train) %0.4f" % svc_train_scr
print "SVC Accuracy (test) %0.4f" % svc_test_scr


SVC Accuracy (train) 0.7353
SVC Accuracy (test) 0.6364

In [407]:
svc_preds = svc.predict(xtest)
svc_stats = scatter_exp_vs_pred(expected=ytest, predicted=svc_preds)



In [409]:
comp = comp.merge(svc_stats.rename(columns={'counts': 'svc_counts'}), 
                   left_on=['predicted', 'expected'], right_on=['predicted', 'expected'])
comp['svc_comparison'] =  map(lambda x, y: float(x) / y, comp.svc_counts, comp.counts)
comp.head(25)


Out[409]:
predicted expected LR_counts counts comparison svc_counts svc_comparison
0 0 0 494 505 0.978218 643 1.273267
1 0 1 324 487 0.665298 509 1.045175
2 0 2 70 178 0.393258 134 0.752809
3 0 3 15 16 0.937500 20 1.250000
4 1 0 800 831 0.962696 792 0.953069
5 1 1 2452 2992 0.819519 2827 0.944853
6 1 2 1074 1927 0.557343 1461 0.758173
7 1 3 220 367 0.599455 255 0.694823
8 1 4 33 41 0.804878 23 0.560976
9 2 0 461 401 1.149626 321 0.800499
10 2 1 3733 2925 1.276239 3134 1.071453
11 2 2 17659 15385 1.147806 16800 1.091973
12 2 3 4049 3202 1.264522 3652 1.140537
13 2 4 363 316 1.148734 261 0.825949
14 3 0 41 59 0.694915 37 0.627119
15 3 1 232 328 0.707317 269 0.820122
16 3 2 1066 2251 0.473567 1434 0.637050
17 3 3 3523 4106 0.858013 3642 0.886995
18 3 4 1165 1186 0.982293 1105 0.931703
19 4 0 7 7 1.000000 10 1.428571
20 4 1 18 27 0.666667 20 0.740741
21 4 2 46 174 0.264368 86 0.494253
22 4 3 424 540 0.785185 662 1.225926
23 4 4 746 764 0.976440 917 1.200262

We can see here that SVC compares quite favorably to logistic regression and multinomial Bayes with improvements in correctly guessing almost every category. Unfortunately, this method is very computationally heavy which makes additional analysis with this method somewhat cumbersome.

Delving Deeper into Each Model's Predictive Accuracy

While the predictive accuracy of these models is actually not bad, it can be somewhat misleading. The issue is that the distribution of classes within the data set is not normal; the vast majority of samples are 2s. This means that any model with a "high recall" of 2 will actually do fairly well against this data set. What we do not get is an accurate sense of how well our model would perform against any arbitrary sample. To assess each model's predictive abilities more accurately then, we should try to test our models on a data set of balanced classes.

To begin, we'll create balanced test and training data sets and run our models against these. We should expect to see the predictive accuracy drop significantly, especially given earlier analysis that showed most of the models' accuracy coming from correctly predicting 2s.

Splitting the Data into Balanced Data Sets

We define a utility function that splits the data into balanced classes, then train our models and analyze the results as before.


In [ ]:


In [384]:
# Create a data set with balanced classes. The class with the least number of samples is '0', with only 7072.
smallest = phrases.groupby('sentiment').count().min()[0]
print "Smallest Class Size: ", smallest


Smallest Class Size:  7072

In [401]:
zeroes = phrases[phrases.sentiment == 0]
ones = phrases[phrases.sentiment == 1]
twos = phrases[phrases.sentiment == 2]
threes = phrases[phrases.sentiment == 3]
fours = phrases[phrases.sentiment == 4]

ones_samples = ones.loc[np.random.choice(ones.index, smallest, replace=False)]
twos_samples = twos.loc[np.random.choice(twos.index, smallest, replace=False)]
threes_samples = threes.loc[np.random.choice(threes.index, smallest, replace=False)]
fours_samples = fours.loc[np.random.choice(fours.index, smallest, replace=False)]

bal_reviews = pd.concat([zeroes, ones_samples, twos_samples, threes_samples, fours_samples])

In [402]:
# Transform and Split the balanced data set.
XY_bal = transform_data(bal_reviews)
XY_bal_all = split(XY_bal[0], XY_bal[1])

In [411]:
bal_models = [test_model(m, XY_bal_all) for m in [MultinomialNB(alpha=1.0), LogisticRegression(C=2.8), LinearSVC()]]


Model[MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)] Accuracy: 0.48%
Model[LogisticRegression(C=2.8, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, penalty=l2, random_state=None, tol=0.0001)] Accuracy: 0.54%
Model[LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss=l2, multi_class=ovr, penalty=l2,
     random_state=None, tol=0.0001, verbose=0)] Accuracy: 0.53%

In [412]:
ypreds = [m.predict(XY_bal_all[1]) for m in bal_models]
bal_stats = [scatter_exp_vs_pred(expected=XY_bal_all[3], predicted=p) for p in ypreds]


As expected, the predictive accuracy of all of the previously explored models saw declines of 10% or more. Balancing the data set has yieleded another interesting insight: namely that comparatively, each models accuracy in predicting the more polarized sentiment classes now seems even better than their accuracy at detecting the neutral phrases.

Binarizing the Data Set - Switching to Coarse-Grained Sentiment Analysis

Such fine grained sentiment analysis is acutally quite difficult. Many sentiment analysis tasks that appear in popular publications focus on more coarse-grained classification: a phrase is categorized as either positive or negative.

We wanted to get a sense for how accurately we could predict sentiments when they fall into one of these two classifications. To do this, we created a "new" data set from the Kaggle data set by removing all neutral reviews (the 2s) and mapping the classes 0 and 1 to "negative" reviews, and 3 and 4 to "positive."


In [413]:
#create dataset of binarized sentiments to positive and negative.
bin_reviews = phrases[(phrases.sentiment != 2)]
sents = bin_reviews.sentiment.tolist()

bin_sents = []
for i in sents:
    if (i == 1) | (i == 0):
        bin_sents.append(0)
    else:
        bin_sents.append(1)
        
#create DF of binary sentiments
bin_reviews = bin_reviews.drop(['sentiment'], axis = 1)
bin_reviews['sentiment'] = bin_sents

In [414]:
bin_reviews.head()


Out[414]:
phraseid sentenceid phrase sentiment
0 1 1 a series of escapades demonstrating the adage ... 0
21 22 1 good for the goose 1
22 23 1 good 1
33 34 1 the gander , some of which occasionally amuses... 0
46 47 1 amuses 1

In [415]:
X_bin, Y_bin = transform_data(bin_reviews)
XY_bin_all = split(X_bin, Y_bin)
xbtrain, xbtest, ybtrain, ybtest = XY_bin_all
[test_model(m, XY_bin_all) for m in [MultinomialNB(alpha=1.0), LogisticRegression(C=2.8), LinearSVC()]]


Model[MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)] Accuracy: 0.86%
Model[LogisticRegression(C=2.8, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, penalty=l2, random_state=None, tol=0.0001)] Accuracy: 0.89%
Model[LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss=l2, multi_class=ovr, penalty=l2,
     random_state=None, tol=0.0001, verbose=0)] Accuracy: 0.89%
Out[415]:
[MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True),
 LogisticRegression(C=2.8, class_weight=None, dual=False, fit_intercept=True,
           intercept_scaling=1, penalty='l2', random_state=None, tol=0.0001),
 LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
      intercept_scaling=1, loss='l2', multi_class='ovr', penalty='l2',
      random_state=None, tol=0.0001, verbose=0)]

As we can see, using this coarse-grained sentiment detection we obtain really impressive accuracy.

One of the persistent problems with sentiment analysis is being able to deal effectively with modifying words that negate or change the meaning of other words.


In [430]:
'''
Function
--------
Prepare specific dataset and run specific model on it.
Vectorizes and splits data into train and test sets, then fits data to the model decided upon and prints out 
accuracy scores.

Parameters
----------
revs: reviews dataframe. The original "reviews" set, the "stop_reviews" set, the "bal_reviews" dataset, 
or the "bin_reviews" dataset
model: 'Bayes', 'Logistic' or 'SVC'

Returns
-------
prints out accuracy score 
'''

def reviews_and_model(revs, model):
    #vectorize dataset
    rarray = revs.phrase.tolist()
    vectorizer = CountVectorizer(min_df=0.00, binary=True)
    vectorizer.fit(rarray)
    X = vectorizer.transform(rarray).tocsc()

    #split data into train and test
    Y = revs.sentiment.tolist()
    xtrain, xtest, ytrain, ytest = train_test_split(X, Y)
    
    #run model
    clf = model.fit(xtrain, ytrain)

    print "Accuracy: %0.2f%%" % clf.score(xtest, ytest)

In [421]:
pe = bin_reviews.copy()
pe['word_list'] = pe.phrase.str.split()
pe['contain_neg'] = map(lambda l:"n't" in l or "not" in l, pe.word_list)
pe['word_count'] = map(len, pe.word_list)

# Analyze phrase sentiment as related to the appearance of the word "not" 
neg_counts = pe.groupby(["sentiment","contain_neg"]).size().to_frame("neg_count").reset_index()
print neg_counts


   sentiment contain_neg  neg_count
0          0       False      31121
1          0        True       3224
2          1       False      40620
3          1        True       1513

In [425]:
def neg_preprocess(revs_list):
    phrases = revs_list
    new_phrases = []
    for phrase in phrases:
        parts = re.split("[\s,\.\!\?\:]+", phrase)
        if "n't" in parts or "not" in parts:
            if "n't" in parts:
                ind = parts.index("n't")
            if "not" in parts:
                ind = parts.index("not")
            temp = []
            for p in parts:
                if parts.index(p) <= ind:
                    temp.append(p)
                elif parts.index(p) > ind and p != '':
                    temp.append("NOT_" + p)
            new_phrases.append(' '.join(temp))
        else:
            new_phrases.append(phrase)
    return new_phrases

In [426]:
neg_reviews = bin_reviews.copy()
lst = neg_reviews['phrase'].tolist()
neg_reviews = neg_reviews.drop(['phrase'], axis = 1)

In [427]:
def ultimate_neg(rev_list, sent_list):
    phrs = rev_list
    sents = sent_list
    new_phrs = []
    new_sents = []
    for phr, sen in zip(phrs, sents):
        prts = re.split("[\s,\.\!\?\:]+", phr)
        if "n't" in prts or "not" in prts:
            new_phrs.append(phr)
            new_sents.append(0)
        else:
            new_phrs.append(phr)
            new_sents.append(sen)
    return new_phrs, new_sents

In [428]:
neg_neg = bin_reviews.copy()
neg_l = neg_neg.phrase.tolist()
neg_s = neg_neg.sentiment.tolist()
neg_neg = neg_neg.drop(['phrase'], axis = 1)
neg_neg = neg_neg.drop(['sentiment'], axis = 1)

In [431]:
neglist, negsent = ultimate_neg(neg_l,neg_s)
neg_neg['phrase'] = neglist
neg_neg['sentiment'] = negsent
reviews_and_model(neg_neg, LogisticRegression(C = 10))


Accuracy: 0.91%

FINAL ANALYSIS

From this exploration of Sentiment Analysis techniques, we learned about some of the challenges of Natural Language Processing. Words are features, but those features don't capture by themselves the subtleties of sentence structure, part of speech, and context. For example, one of the shortcomings of our models is that they use a Bag of Words approach, which doesn't take some of those subtleties into account; each word is a distinct feature.

We tried to combat those limitations by manipulating the data in different ways, namely:

  • We accounted, albeit in naive ways, for the presence of negation within sentences.
  • We used a stop words list to filter out some of the noise created by the presence of "filler" words such as articles and prepositions, in an effort to more easily extract the essence of sentiment contained in each phrase.
  • We tried using n-grams, which are additional features fed to the model based on aggregating series of n words into single features, in order to give our models a certain notion of context. This last attempt gave us the idea to code certain heuristics into our models. For example, looking at the proportion of "negative" predictions in the subset of our binarized dataset containing negation, we realized that hard-coding systematic "negative" classification in certain cases could yield a strategic gain in accuracy, which it did.

Because the dataset was very imbalanced in terms of class distribution, we decided to create a subset of our original set, in which each class would have a randomly chosen number of phrases equal to the number of phrases in the minority class, thereby artificially creating a perfectly balanced dataset. On the whole, additionally to our initial dataset, we experimented with 3 other datasets optimized in different ways.

Using cross-validation to find optimal parameter settings, heuristic functions and optimized data, we managed to push our accuracy to 65% on the 5-class dataset, which is over three times higher than random expectation, and to a surprising 91% on the binarized-sentiment dataset. The model that performed the best on both the 5-class and binarized datasets was Logistic Regression, which slighlty outperformed the Naive Bayes classifier. Looking for reasons explaining that discrepency, we realized that a Bayesian approach relies on the idea that features are conditionally independent. This is clearly not fully the case with the datasets we used, which despite the fact that the NB classifier still yielded very good results, probably affected its performance.

Further improvements could be made in the form of:

  • more sophisticated parsing of sentences that would take into account the actual syntactical and semantical structures of the text.
  • the inclusion in our models of more complex and precise heuristics
  • Ensemble methods. After visualizing the ways in which our models mispredict data, we realized that since their strengths and weaknesses in terms of class prediction were the same, there would be little value to using ensemble methods, which are predicated upon the idea of exploiting the alliance of different strengths and confidence levels across models. However, with more sophisticated models, it is very possible that such discrepencies would start to reveal themselves and make more appealing the use of Ensemble methods.

In [ ]: